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Abstract 

Background: Chained equations imputation is widely used in medical research. It uses a set of conditional models, 
so is more flexible than joint modelling imputation for the imputation of different types of variables (e.g. binary, 
ordinal or unordered categorical). However, chained equations imputation does not correspond to drawing from a 
joint distribution when the conditional models are incompatible. Concurrently with our work, other authors have 
shown the equivalence of the two imputation methods in finite samples. 

Methods: Taking a different approach, we prove, in finite samples, sufficient conditions for chained equations and 
joint modelling to yield imputations from the same predictive distribution. Further, we apply this proof in four specific 
cases and conduct a simulation study which explores the consequences when the conditional models are compatible 
but the conditions otherwise are not satisfied. 

Results: We provide an additional "non-informative margins" condition which, together with compatibility, is 
sufficient. We show that the non-informative margins condition is not satisfied, despite compatible conditional 
models, in a situation as simple as two continuous variables and one binary variable. Our simulation study 
demonstrates that as a consequence of this violation order effects can occur; that is, systematic differences depending 
upon the ordering of the variables in the chained equations algorithm. However, the order effects appear to be small, 
especially when associations between variables are weak. 

Conclusions: Since chained equations is typically used in medical research for datasets with different types of 
variables, researchers must be aware that order effects are likely to be ubiquitous, but our results suggest they may be 
small enough to be negligible. 

Keywords: Chained equations imputation, Gibbs sampling, Joint modelling imputation. Multiple imputation. 
Multivariate missing data 



Background 

Multiple imputation [1] has become a popular approach 
for the analysis of incomplete data, with several main- 
stream statistical packages now incorporating multiple 
imputation tools. It involves making several draws of the 
missing data from their posterior predictive distribution 
given the observed data and an imputation model. For 
multivariate, non-monotone missing data there are two 
main approaches for constructing an imputation model: 
joint modelling and chained equations. Joint modelling 
imputation requires the specification of a parametric joint 
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model for the complete data: current implementations 
impute under the multivariate normal model, the log lin- 
ear model and the general location model [2]. However 
for datasets containing different types of variables the 
current classes of joint models [3-5] may not be appropri- 
ate for the joint distribution of the data. The alternative 
method, chained equations imputation [4,6], is more flex- 
ible as it specifies a separate imputation model, typically 
a univariate regression model, for each incomplete vari- 
able and updates the missing data for each variable in 
turn. 

Chained equations imputation has been proposed 
under several different names including: fully conditional 
specification, stochastic relaxation, variable-by-variable 
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imputation, regression switching, sequential regressions, 
ordered pseudo-Gibbs sampler, partially incompatible 
MCMC and iterated univariate imputation [3]. In addi- 
tion to handling variables of varying types, the chained 
equations approach has other flexible features such 
as incorporating restrictions, logistical and consistency 
bounds (for example, to handle imputation of gender 
specific variables or impute only questions that were 
not intentionally skipped in a questionnaire [4]). van 
Buuren and Groothuis-Oudshoorn [7] discuss the wide 
range of medical fields that have used chained equations 
imputation (e.g. addiction [8], epidemiology [9], infec- 
tious diseases [10], genetics [11], cancer [12], obesity and 
physical activity [13]), and a brief review of available 
software that have implemented chained equations impu- 
tation is given by [14]. Given the popularity of chained 
equations, among users of varying degrees of exper- 
tise, there is now guidance in its practical use (e.g. [14] 
and [15]). 

Despite its widespread use, a known theoretical weak- 
ness of the chained equations method is that the implicit 
joint distribution underlying the separate models may 
not always exist: that is, the conditional models may be 
incompatible [4,5,16-18]. In such situations, the results 
after chained equations imputation may systematically 
differ according to the order in which the missing 
variables are updated in the chained equations algo- 
rithm. We shall refer to this phenomenon as an "order 
effect". 

Previous authors [3-5] have stated that chained 
equations imputation under a set of normal linear regres- 
sion models, with all other variables as covariates and no 
interactions, is equivalent to a Gibbs sampler that draws 
from a multivariate normal distribution, van Buuren [3] 
also states for a dataset of three partially observed binary 
variables that chained equations under a set of logistic 
regression models, with all other variables included as 
main effects only, is equivalent to a joint modelling impu- 
tation under a log linear model with the three-way factor 
term set to zero. However, none of these papers provides 
a proof beyond stating that the set of conditional models 
is compatible [16] and are all derived from the specified 
joint distribution. 

Independently and concurrently with our work, Liu 
et al. [19] have given sufficient conditions (which include 
compatibility of the conditionals) under which, as the 
sample size tends to infinity, the stationary distribu- 
tion of the Markov chain generated by the chained 
equations algorithm (assuming that this stationary distri- 
bution exists and that the chain converges to it) converges 
to the posterior predictive distribution of the missing data 
implied by a joint Bayesian model. That is, under these 
sufficient conditions, the total variation of the distance 
between the chained equations stationary distribution and 



the posterior predictive distribution tends to zero as the 
sample size tends to infinity. As a corollary, Liu et al 
show the equivalence of the two imputation methods in 
finite samples under a condition we have independently 
identified and named the "non-informative margins" 
condition. 

Our work is complementary to that of Liu et al. Firstly, 
we have taken a different approach to prove the equiv- 
alence of the two imputation methods in finite samples. 
Additionally, in specific examples, we prove whether the 
non-informative margins condition is satisfied or not, and 
in a simulation study we demonstrate the consequences 
when the conditional models are compatible but do not 
satisfy the non-informative margins condition. 

In this paper, we provide a "non-informative margins" 
condition that, together with compatibility of the condi- 
tionals (and assuming that the Markov chain generated 
by the chained equations converges to a stationary dis- 
tribution), guarantees that the imputed values obtained 
using chained equations (at convergence) are drawn from 
the posterior predictive distribution of the missing data 
implied by a Bayesian joint model. We give examples 
of chained equations algorithms that satisfy the non- 
informative margins condition when the joint model is 
the multivariate normal model and the saturated multi- 
nomial model, and examples where this condition is 
not satisfied when the joint model is an unsaturated 
multinomial model and the general location model. A 
simulation study considers a simple chained equations 
algorithm in which the conditional models are compatible 
but do not satisfy the non-informative margins condition, 
and shows that it is not equivalent to any joint model 
procedure. 

Methods 

Notation 

Suppose K random variables X = (Xi, . . . ,X]()'^ are 
intended to be observed on N subjects. We use subscripts 
i and / to index subjects and variables respectively (/ = 
l,...,N; j = 1,. . .,K). Let x = {xij) denote an {N x K) 
matrix, whose i,j element is Xij. Column / of matrix x is 
denoted by = {xij, . . . ,XNj)'^ .His assumed that the rows 
of matrix x are independent and identically distributed 
draws from a probability distribution with probability 
distribution function p{X \ 0), where 6 is an unknown 
parameter. 

In practice some subjects have missing observations on 

up to — 1 variables and we write Xj = (x°^^, ^™") for 

axiyi,x = (xo^'.x^'') andpix | 0) = pix^^'.x^'' | 0), 
with superscript obs and mis denoting the observed and 
missing data respectively. In keeping with the assump- 
tions of joint modelling imputation and chained equations 
imputation, the missing data mechanism is assumed to 
be ignorable for Bayesian inference [20] p. 120, so that 
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inferences about 9 can be based on the marginal observed 
data posterior p(9 | x"^^). 

Joint modelling Imputation 

Joint modelling imputation requires the specification of a 
parametric joint model p{x°^^, | 9) for the complete 
data and a prior distribution p{0) for parameter 9. Imputa- 
tions are independent draws from the posterior predictive 
distribution of the missing data given the observed data 
fix""'' I x"''') [2] p. 105, which under the ignorability 
assumption is 

p {x""'' I x"'"') = jp (^'' I x"'", 9)p(9\ x"'") d9. 

Therefore, to draw from this posterior predictive distri- 
bution, first draw 9* ~ p{9 \ x"'"') foUowed by x'"'"* ~ 
pix""'' I x''''\9*) [2] p. 105. When it is difficult to draw 
from the observed data posterior p{9 \ x°^^), Markov 
chain Monte Carlo methods can be used. For example, the 
data augmentation algorithm of Tanner and Wong [21] 
draws missing values from the posterior predictive dis- 
tribution x""''* ~ pix""'' I x"^', 9*) and then draws 9 
from the complete data posterior 9* ~ p(9 \ x°^\x""-^*), 
where * denotes the last drawn values of 9 or x'"^'^. Upon 
convergence this produces a draw from the joint posterior 
distribution /7(6>,;«;'"" | x°^% 

Chained equations Imputation 

For every incomplete variable the chained equations 
algorithm requires an imputation model, typically a uni- 
variate regression model, and an accompanying prior 
distribution for the model's parameter Let X_y = 
(Xi, . . . , Xj-i, Xj+i, . . ., Xx)^ denote the vector of random 
variables excluding variable Xj and x-j = (x^J'px'^j^) 
the submatrix of x corresponding to variables X-j. We 
write p{Xj | X-j, for the probability distribution func- 
tion of the imputation model for variable Xj and piifj) 
for the prior distribution of the unknown parameter 

iff. 

Chained equations draws the imputations using an iter- 
ative algorithm, typically with 10 to 20 iterations [15]. 
To start off, the missing values of each incomplete vari- 
able are replaced by its mean or a random sample of its 

observed values. Suppose, without loss of generality, that 
variables Xi, . . .,Xg (R < K) are incomplete and variables 
Xg+i, . . .,Xk are fully observed. Given the imputations 
from the last iteration {x^^ ^\...,x^^ ^'), iteration t of 
the chained equations algorithm consists of the following 
draws [18] 

Vff ^ ~ p{iri)pixf' I 4'"^'. 4*"^'. • ■ • .x^r'^Kxr+i, ...,xk, fij 

X-^ p ^^-j^ ' I ^2 >'^3 ; • • • f^Ji f^R-\-l> • • • f^K* V^i ^ 



~ p(if2)p (xf^ I x^l\x^^ ^\ ...,Xg ^\xr+i, . . . ,xk, V^a) 

^2 ^ P \ X-^ tX^ t • • • t Xr t XR-\-\ t • • • t XKt ^2 ^ 

fu^ ~ PifR)P (x°J" I 4*''4''' • • • .4-l'*«+l' • • • '^K, fuj 
mis(t) ^ ( mis i (0 («) (f) i (t)\ 

^R ^ P \^R \ '^2 > ' • • ' ^R—l'^R+i^' • • • ' ^K> YR J' 

During each iteration the following two steps are applied 
to each incomplete variable Xj in turn: x/r* is drawn from 

the posterior distribution proportional to p(t{fj)p(xj^^ \ 
x*_j, tjfj) and missing values :«™"* are drawn from the pre- 
dictive posterior p{x1^^^ \ x*_j, rlrp. The imputations from 
the last iteration form the imputed dataset. The whole 
iterative algorithm is repeated to obtain further imputed 
datasets. 

Equivalence of joint modelling and chained equations 
imputation 

We investigated, in finite samples, sufficient conditions 
under which a chained equations algorithm with com- 
patible conditional models imputes missing data from 
the predictive distribution of the missing data implied 
by the joint model and its accompanying prior. We pro- 
vide examples of chained equations algorithms (with 
compatible conditional models) where our identified 
condition is satisfied and examples where it is not 
satisfied. 

Simulation study 

We conducted a simulation study to explore the con- 
sequences for chained equations imputation when the 
conditional models were compatible with the same joint 
model but the non-informative margins condition of 
Proposition 1 was not satisfied. In particular, we looked 
for evidence of "order effects", where the distribution from 
which the final imputed values of the variables were drawn 
differed according to the order in which the variables were 
updated in the chained equations sampler. If the chained 
equations algorithm imputes all variables from the predic- 
tive distribution of the missing data implied by a specific 
joint model, then order effects cannot occur [22]. Thus, 
the existence of order effects implies that the chained 
equations algorithm is not equivalent to imputing from 
any joint model. 

The simulation study was based on a general location 
model, discussed in the Theoretical results section below, 
with one incomplete binary variable Y and two continu- 
ous variables Wi and W2, where Wi was also incompletely 
observed. We compared joint modelling imputation under 
the general location model, considered as a gold standard. 
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with the chained equations algorithm that imputes the 
binary variable Y under a logistic regression model and 
the continuous variable Wi under a normal linear regres- 
sion model. 

We generated 500 datasets, each with a sample size of 
100. For each dataset, the rows were independent, identi- 
cally distributed realizations of the general location model 
Y - Bernoulli(3/10), Wi \ Y ^ N{10 + PY,9) and 
W2 I Wi,Y Ni9 + 8/9 + l/9Wi + PY,8 + 8/9). 
The data model was a simplified version of data that 
can occur in the medical literature [23]. The simulation 
study was repeated when (i, the regression coefficient for 
covariate Y, was set to 1 and 3. The analysis of interest 
was the normal linear regression of W2 on Wi and Y. 
To ensure that any observed order effects could only be 
due to the failure of the non-informative margins con- 
dition we considered the simplest setting, that of data 
missing completely at random [20] p. 16, and set the 
values of Y and Wi to be missing for the first 50 indi- 
viduals in the dataset. Below we describe the joint mod- 
elling imputation procedure and the chained equations 
algorithm that were separately applied to the same 500 
datasets. 

We used the data augmentation algorithm (as described 
under the heading "Joint modelling imputation") to per- 
form joint modelling imputation under the general loca- 
tion model and the joint prior given in the general location 
example (see example 4 of the Results), setting hyper- 
parameters T = V = 1/2 and k = 3/2. The number of 
imputed datasets generated, the burn-in period and the 
number of iterations between imputed datasets was 100. 
The analysis model was applied to each dataset separately 
and the mean of the multiple estimates of fi, the coefficient 
for Y, was calculated. 

In the (standard) chained equations algorithm, a logistic 
regression model for Y given Wi and W2 was first fitted to 
those rows of the dataset in which Y was observed. Let 
denote the maximum likelihood estimate of the parame- 
ters of this model and V denote its associated estimated 
variance-covariance matrix. A draw xlrp was then made 
from the multivariate normal approximation N{ijrY, V) 
and used to impute the missing Y values. The continu- 
ous variable Wi was imputed using the linear regression 
model Wi \ r, W2 ~ N{X + ^Y + (pWj.m) and prior 
distribution p(X, (/>, co) oc a>~^^^. 

To start off the chained equations algorithm the miss- 
ing values of Y and Wi were replaced with a ran- 
dom sample of their observed values. We augmented 
the chained equations algorithm such that, within each 
iteration we fitted the analysis model immediately after 
updating the binary variable Y and also immediately after 
updating the continuous variable Wi. The simulation 
study focused on systematic differences between the two 
resulting estimates of fi. Given the imputations from the 



last iteration (y ', ), iteration t of the augmented 

chained equations algorithm consisted of the following 
steps: 

1. Generate 7*^' by imputing values for the missing 
binary observations, conditioning on w'f~^^ and W2. 

1. Linearly regress 14/2 on w^* and j^^-* and store the 
estimate for the coefficient of Y, denoted by /i^. 

3. Generate wf* by imputing values for the missing 
continuous observations, conditioning on j*'^ and MI2. 

4. Linearly regress 14/2 on wj*^ and j'^*-' and store the 
estimate for the coefficient of Y, denoted by fi'^. 

The chained equations algorithm was implemented 
with 10010 iterations. The first 10 iterations were 
regarded as burn-in and the estimates from these iter- 
ations discarded. The remaining 10000 estimates of /J^ 
were averaged, and likewise for We denote these 
means as and and their difference by - p". The 
quantity ji^ — ji'^ can be interpreted as an estimate of the 
order effect for imputation in one dataset. We estimated 
the (Monte Carlo) standard error of — fi'^ using the 
batch-means method, a method for computing standard 
errors for correlated output [24] p. 124, and calculated a 
95% confidence interval from this. 

Linear discriminant analysis is an alternative way 
to estimate a logistic regression [25,26]. A modified 
chained equations algorithm using linear discriminant 
analysis on all individuals with observed Y has been 
proposed as an alternative way to impute the binary 
variable Y [27]. Because the linear discriminant likeli- 
hood is the joint distribution of Y, Wi and W2, this 
model has the advantage of recovering some informa- 
tion about ify in the W margin. We repeated the 
simulation study using this modified chained equations 
algorithm. 

We assessed the sensitivity of our results by repeat- 
ing the simulation study using different specifications. 
For joint modelling imputation we increased the num- 
ber of imputed datasets generated, the burn-in period 
and the number of iterations between imputed datasets 
to 250. For the standard and modified chained equations 
procedures we (1) increased the burn-in period of the 
chained equations sampler to 1000 iterations and (2) sam- 
pled every 50*'' iteration thereby reducing serial correla- 
tion (with a burn-in period of 10 iterations). To check 
that our results were not dependent upon our choice of 
prior distributions we repeated the simulation study with 
improper imputation procedures; that is, using maximum 
likelihood estimates of i/f, instead of Bayesian draws xj/j 
from its posterior distribution p{ifj)p{Xj^^\x'^j, ifj). Lastly, 
we also repeated the simulation study with a sample size 
of 1000 observations. 
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Results 

Theoretical results 

Equivalence of joint modelling and chained equations 
imputation 

In this section, we give our key result Proposition 1, 
which shows that, in finite samples, compatibility of the 
conditionals and our proposed non-informative margins 
condition are sufficient for chained equations and joint 
modelling to yield imputations from the same predictive 
distribution. 

Consider a joint model p(x | 9) and prior p(9). From 
here onwards we shall use /?(. | .) to refer to any probability 
distribution derived from this joint model. In particular, 
for each J = 1, . . .,R, p(Xj \ X-j, 0) is the conditional dis- 
tribution of Xj given X-j and 9 implied by the joint model, 
and p(X-j I 9) is the conditional distribution ofx-j given 
9. The distribution of:* given 9 factorizes as 

p(x I 9) =p{xj I x-j,e)p{x-j I 6*). 

Let tjrj and tjfj be functions of 9 such that p(xj \ X-j, 9) = 
p{Xj I X-j, i/j) and p{x-j \ 9) = p{x-j \ ^jfj). That is, the dis- 
tribution of Xj given x-j and the distribution of x-j depend 
on 9 only through the functions i/fy and i/fy, respectively, of 
9. Let piirj, -tj/j) denote the joint prior for i/fy and ^j implied 
by p{9), and let p{:^j) and p(^j) denote the corresponding 
marginal priors. 

The chained equations algorithm applies the following 
two steps for each Xj in turn: 

Step CEl Draw from the distribution proportional to 

p{^j)p{.xf I X*_j, ^j). 

Step CE2 Drzwx^''* frompixf' | x*_j, fp. 

The choice of the parameterizations i/fy and i/f, does not 
affect the output of step CE2, but a parsimonious choice 
will help to make the condition of Proposition 1 hold. 

Proposition 1. Upon convergence, the chained 
equations algorithm defined by CEl and CE2 and the 

joint model pix \ 9) and prior p{9) draw from the same 
predictive distribution of ^v*"" if, for each incomplete 
variable Xj, piifj, ifj) = p(if/)p(i^j)> i-e. if the joint prior 
distribution for i/fy and ifj factorizes into independent 
priors for and xffj. 

Proof. Using the condition of Proposition 1, 

P (fj, fj I xf,x*_j) a p (fj, fi) p [xf\x*_j I i^j, f,) 

= piifj)pifj)p (xf' I X*_j,f^p{x*_j I 



Now, integrating out t/f,-, 

p (fj I ^^*;) <xp(^j)p (xf' I x*_j, ifj) 

Therefore, step CEl yields a draw from p(irj \ xj^^, x*_j). 

Next, 1^'' and xf' are conditionally independent given 
x*_^ and if* so p(xf' \ x*_j,f*) = p(xf' \ xf',xtj,f*) 
and step CE2 yields a draw from />(;cj"" | x'j^^,x*_j, fp. 

Hence steps CEl and CE2 together yield a draw from 
piifj, I Xj^^x^j), and in particular they draw xj"^* 
from p(xj"^ I x'j^^,x'^j). The latter is a full-conditional dis- 
tribution corresponding to the joint density p{x) implied 
by the joint model. Once xj"^ has been sampled, irp 
the sampled value of ifj, is not used again in the chained 
equations algorithm. So, the application of steps CEl and 
CE2 to each ; in turn and then iterating is a Gibbs sampler 
which, at convergence, yields a draw from p(x""^ \ x°^^), 
the predictive distribution implied by the joint model and 
its accompanying prior. 

Comment 1. The condition of Proposition 1 does not 
hold if the conditional and marginal parameters ifj and ifj 

are not distinct (i.e. if their joint parameter space is not the 
product of their separate parameter spaces), and in partic- 
ular if the combined dimension of ifj and ifj is greater than 
that of 9. Distinctness of parameter spaces is a property 
of the model p(x \ 9) and not of the prior p(9). It will be 
used in the examples of the unsaturated multinomial dis- 
tribution and the general location model to identify joint 
models where the condition of Proposition 1 does not 
hold. 

Comment 2. Heuristically, the condition of Proposition 1 
says that there is no information about 1/9 in the marginal 
distribution of X-j, so we call it the "non-informative mar- 
gins" condition. When such information does exist, it is 
used by the joint modelling sampler but not by the chained 
equations algorithm, so they may draw from different dis- 
tributions. Our simulation study will show that this occurs 
in an example by demonstrating order effects. 

Comment 3. As the non-informative margins condition 
has only been shown to be sufficient, then potentially 
if this condition is not satisfied the chained equations 
algorithm defined by CEl and CE2 and the joint model 
pix I 9) and prior pi9) could still draw from the same 
predictive distribution of x'"'*. 

Comment 4. This proof holds for improper prior distri- 
butions provided the posterior distributions are proper. 

Comment 5. When the non-informative margins condi- 
tion holds true the chained equations algorithm is a Gibbs 
sampler, and so order effects cannot occur [22]. 
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Example 1 : Multivariate normal 

Consider the multivariate normal joint model X ~ 
N(iu., E) with parameter ^ = E) and prior distribution 
p{0) oc lEp*^ (k e Q). We show that the corresponding 
chained equations algorithm, which imputes under a set 
of normal linear regression models, satisfies the non- 
informative margins condition of Proposition 1 (and 
hence draws from the same joint model as joint modelling 
imputation). 

For each j we partition the mean vector fi as (/jlj, jlj)^ 
and the covariance matrix E as 




-Si Sy_ 

such that Xj ~ Nifj,j, Oj) and X-j ~ N{jlj, Ey). The con- 
ditional distribution of Xj given X-j is the normal linear 
regression model Xj \ X-j ~ N{aj + fiJX-j, coj) where 

fij = dj = f^j - fijiij and Wj = cxj - s/^f^Sj P] 

p. 157. Using our notation for chained equations imputa- 
tion (see under the subsection "Chained equations impu- 
tation" of the Methods section) ij/j = (uj, fij, coj) and i/f,- = 

The joint prior for ^7 and i/f, derived from p{6) is 
p{\l/j, fj) = lEp" X |Ey| [2] p. 158-159. So, using a stan- 
dard result from matrix algebra for the determinant of a 
partitioned matrix, 

pifj, fj) = (aj - gJt-\^~' X \tj\-' X \tj\ 
= «r- X |Eyr(^-" 
= p{fj) xp{if(j)). 

Therefore, the non-informative margins condition of 
Proposition 1 is satisfied. 

Example 2: Saturated multinomial distribution 
We now show for joint modelling imputation under a sat- 
urated multinomial model and a Dirichlet prior for 9, that 
the corresponding chained equations algorithm satisfies 
the non-informative margins condition of Proposition 1. 

Consider K categorical variables X = (Xi, . . . ,Xk), 
where each Xj takes possible values 1, ...,/,(/ = 1, . . .,K). 
Variables X define a iC-way contingency table. Let c = 
(ci, . . . , ck) denote a generic cell of the contingency table, 
9c denote the cell probability pr(X = c) and 9 denote the 
set of all cells of the contingency table. The joint distri- 
bution of X is a multinomial distribution with parameter 
9 = {9c : c & d) and index equal to 1. 

Summing the table counts over variable Xj produces 
a collapsed contingency table defined by variables X-j. 
Let C-j = (ci, . . . , Cj-i, Cj+i, . . ., c/c) denote a generic cell 
of the collapsed table, i^c-; denote the cell probability 
pr(X-j = C-j) and dj denote the set of all cells of the 



collapsed table. The marginal distribution of X-j is multi- 
nomial with parameter i/fy = {iAc_; : C-j € dj], where 
^jfc-j = Yl\=i ^c- The conditional distribution of Xj given 
X-j = C-j is the multinomial distribution with parameters 
ifc-j = [priXj = Cj I X-j = c-j) : Cj = 1, . . So, the 
full set of parameters for the conditional distribution oiXj 
given X_y is i/fy = {ijrc-j ■ c-j edj}. 

If the prior distribution for 9 is Dirichlet with hyper- 
parameter a = {uc : c G 9}, then the implied prior 
distributions for {frj and xjfj are independent: the prior for 
irj is Dirichlet with hyperparameter a,- = {occ_j : c-j e dj], 

where ac_j = X!c =i "^ci and the prior distribution for i/fy 
is the product of the set of independent Dirichlet distri- 
butions {ifc-j ~ D(ac_j) : C-j e 9,}, where ac_f = [uc '■ 
Cj = 1, . . . ,Ij} is a subset of a [2] p. 256. Since the prior 
for -ij/j) can be factored into independent distributions 
for and ^j, the non-informative margins condition of 
Proposition 1 is satisfied. 

Example 3: Unsaturated multinomial distribution 

When the joint model is an unsaturated multinomial 
model, we give an example where the conditional and 
marginal parameters 1/9 and V) are not distinct (see 
comment 1 of Proposition 1). Consequently the non- 
informative margins condition of Proposition 1 is not 
satisfied. 

Consider K categorical variables X = (Xi, . ..,Xk) as 
described in the saturated multinomial example. Assume 
that all cell probabilities are positive, 9(c) > 0 for all c, to 
ensure that every multinomial distribution considered can 
be written as a log linear model and that all possible con- 
ditional distributions exist [28] p. 202. Let the joint model 
be the hierarchical log linear model that contains all two- 
way factors between the K variables and no higher order 
factors . We shall refer to this as the all two-way factor hier- 
archical model. Under this model, for any / the conditional 
distribution oiXj given X-j follows a multinomial logistic 
regression model (or a logistic regression model when Xj 
is binary) where the regression model includes variables 
X-j as main effects only (i.e. no interaction terms). 

In generating class notation [29] the all two-way fac- 
tor hierarchical model is written as [1 2] ... [1 K] . . . 
[{K — 1) K]. Any hierarchical log linear model for X can 
be represented by an undirected graph in which the graph 
vertices are the variables Xi, . . .,Xk and two vertices Xg 
and Xfi are connected by an edge if and only if the log 
linear model contains two-factor or higher order terms 
involving variables Xg and Xf,. Any model containing all 
two-factor terms is therefore represented by the complete 
graph. Asmussen and Edwards [29] state that two vertices 
in a graph are adjacent if there is an edge between them. 
For any subset a of the vertices, Asmussen and Edwards 
define the boundary of a to be the set of vertices that are 
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not in a but that are adjacent to one or more vertices in 
a. So, for the complete graph, the boundary of Xj is X-j. 
Theorem 2.3 of [29] states that a hierarchical log linear 
model is collapsible onto X^j if and only if the bound- 
ary of Xj is contained in a generator of the hierarchical 
log linear model. Further, Theorem 4.1 of [29] states that 
if a hierarchical log linear model is not collapsible onto 
X-j, then parameters i/f/ and i/iy are not distinct. For any 
X-j, the boundary of in the complete graph, contains 
all of the remaining K — I variables. When K > 4, X-j is 
not contained in any of the generators, [1 2] . . . [1 /C] . . . 
[(K — 1) K], of the all two-way factor hierarchical model, 
and hence the log linear model is not collapsible onto 
Therefore, parameters \j/j and are not distinct, and so 
the non-informative margins condition of Proposition 1 is 
not satisfied when K >4. 

Example 4: General location model 

We give an example of a chained equations algorithm 
derived from joint modelling under a general location 
model that does not satisfy the non-informative margins 
condition of Proposition 1. Our simulation study is based 
on this example. 

Suppose that the data on each individual consist of one 
incomplete binary variable Y and K — 1 continuous vari- 
ables W = {Wi, . . . , Wk-i)^, where one or more of the 
continuous variables are also incomplete. Let the joint dis- 
tribution of X = {Y, W)^ be the general location model 
Y ~ Bernoulliiy) and W | r ~ Ni/xo + i^iY, E), for 
unknown parameters 9 = (y, /u-o; Mi; [30]. Let the joint 
prior for 0 be piO) = y^-^l - yY''^ | E with 
hyperparameters t, v > 0 and k € Q. 

From the multivariate normal example above it is 
straightforward to show that the non-informative margins 
condition of Proposition 1 holds for imputing any Wj. 

The conditional distribution of Y given W, p(Y \ W, 
^y)> is the logistic regression model with covariates W 
[25]. The marginal distribution of W, p{W \ i^y), can be 
written as a mixture of normal distributions 

p{W I fY) = ypiW \Y=1, MO, Ml, S) 

+ il-y)p(W\ Y = 0,no,'£). 

This cannot be parameterized more parsimoniously 
than i^r = (y, /u-o, l^i> = As ify is a function of 6, it 
is determined hyO = ^y- 

Consequently, given i/fy the parameters of the logis- 
tic regression model tp^y are fully determined, and so ^y 
and are not distinct. Therefore, as discussed in com- 
ment 1 above, the non-informative margins condition of 
Proposition 1 does not hold. 

Using the same argument, the non-informative margins 
condition of Proposition 1 does not hold for a chained 
equations algorithm derived from joint modelling under 



the restricted general location model, with cell probabil- 
ities restricted by the all two-factor hierarchical model 
(discussed in the unsaturated multinomial example) and 
cell means restricted to be a linear function of the categor- 
ical variables. 

Simulation study results 

This section reports the results of the simulation study, 
where the chained equations conditional models were 

compatible with the same joint model but the non- 
informative margins condition of Proposition 1 was not 
satisfied. 

Figures la and lb show, for the first 30 of the 500 
datasets, the value of — fi'^ (estimate of the order effect 
in one dataset) along with its 95% confidence interval, for 
the (standard) chained equations procedure (i.e., binary 
variable Y imputed under the logistic regression model). 
In a number of the datasets the 95% confidence inter- 
vals did not cross zero. Thus, there was clear evidence of 
order effects, with the magnitude of such effects varying 
between datasets. Such statistically significant evidence of 
an order effect occurred in 164 and 386 of the 500 datasets 
for yS = 1 and p = 3 respectively. The range of absolute 
values of — P'^ was larger for p — 3 than for P = 1. 
These results confirm that the chained equations proce- 
dure was not equivalent to any joint model procedure. 

The average magnitudes of the order effects p^, P'^ and 
\P^ — P'^l over the 500 datasets are shown in Table 1. 
Consider the results for the chained equations algorithm 
(labelled LR). In keeping with Figure 1, the average mag- 
nitude of the order effect was larger for p = 3 than 
P = 1. The means of p^ and P'^ did not differ system- 
atically, consistent with the direction of the order effect 
being arbitrary and dataset dependent. Estimates P^ and 
P"^ appeared to be equally good estimates of p. 

The forest plots oip^ — p'^ with 95% confidence intervals 
corresponding to the modified chained equations algo- 
rithm (i.e Y imputed under the linear discriminant model) 
are shown in Figures Ic and Id. Order effects were smaller 
than in Figures la and b, but were still present because 
the modified algorithm did not use information about Wi 
and W2 when Y was missing. Out of the 500 datasets the 
number of datasets that showed statistically significant 
evidence of an order effect was 113 for ;6 = 1 and 330 for 
p = 3. 

From Table 1, the complete case estimates of p were 
unbiased. When P = 3 the linear discriminant analysis 
values p'^ and P^, and the joint modelling imputation esti- 
mate PjM were slightly biased towards the null. This bias 
was due to the prior (which was the same for imputation 
under the linear discriminant analysis model and joint 
modelling imputation under the general location model) 
and it disappeared when the sample size was increased 
to 1000 observations and when the simulation study was 
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(a) logistic regression; /? = 1 




-0,15 0 0-15 -0-15 0 0-15 

(c) linear discriminant analysis; /3 = 1 (d) linear discriminant analysis; /? = 3 

Figure 1 Four forest plots of the posterior mean differences /S* - Each panel is a forest plot of — p'- for the first 30 datasets, 95% 
confidence intervals calculated using the IVlonte Carlo standard error. Panels (a) and (b) correspond to binary variable Y imputed under the logistic 
regression model, with /i = 1 and /i = 3 respectively. Panels (c) and (d) correspond to / imputed under the linear discriminant model, with fi = 1 
and (6 = 3 respectively. 



Table 1 Over 500 datasets, average of the complete case estimates, joint modelling imputation estimates and values of 

jS'^ and — P'\ for the chained equations algorithm and the modified chained equations algorithm, with confidence 
intervals [mean ± 1.96 x (standard deviation -i-^500)] 
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b complete case analysis estimate of ^ from the observed data only; 

U estimate of ^ from joint modelling imputation; 

V logistic regression; 

II linear discriminant analysis. 
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conducted using improper imputation; i.e using maxi- 
mum likelihood estimates instead of Bayesian draws of 
parameters. For joint modelling imputation, and the stan- 
dard and modified chained equations procedures chang- 
ing their specifications (e.g., larger number of iterations, 
burn-in period) gave the same pattern of results as above 
(results not shown but available on request from the 
authors). 

In preliminary simulation studies, when the non- 
informative margins condition was satisfied the results 
were consistent with a zero order effect (results not shown 
but available on request from the authors). 

Discussion 

We have defined a non-informative margins condition 
which, together with compatibility of the conditional 
models, we have proved is sufficient for a chained 
equations algorithm to impute missing data from the pre- 
dictive distribution of the missing data implied by the joint 
model and its prior distribution. Also, we have shown 
that compatibility of the conditional models is not alone 
a sufficient condition. In a scenario where the condi- 
tionals models were compatible but the non-informative 
margins condition failed, our simulation study showed 
that the distribution from which the final imputed val- 
ues of the variables were drawn differed, in a dataset- 
dependent manner, according to the order in which 
the variables were updated in the chained equations 
sampler. 

In work that is complementary to the finite-sample 
results presented in this paper, Liu et al. [19] identified suf- 
ficient conditions for chained equations imputation and 
imputation under a fully Bayesian model to be asymp- 
totically equivalent; that is, for the supremum of the 
difference between the two imputation distributions to 
converge to zero as the sample size tends to infinity. This 
implies that when the non-informative margins condition 
is not satisfied but the conditional models are compati- 
ble with the same joint model, the order effects identified 
in our simulation study will disappear as the sample size 
tends to infinity. 

In our simulation study the average magnitude of the 
order effects was small and did not induce bias. Given that 
chained equations imputation is a widely used approach to 
imputation, these results are somewhat reassuring. How- 
ever, the scope of these simulations was limited and it 
remains possible that chained equations imputation could 
lead to more substantial bias in different situations; for 
example, when there are many partially observed vari- 
ables. 

When the non-informative margins condition does not 
hold we expect some loss of efficiency in general (because 
some information is discarded). However, in our simula- 
tion study we did not detect any sizable loss of efficiency. 



The issue of variance estimation for chained equations 
imputation is beyond the scope of this paper. 

The advantage of chained equations imputation is that 
we do not need to specify the joint distribution of the vari- 
ables. In cases where it is not known that there is a joint 
distribution, several methods for checking compatibility 
have been proposed (e.g., [16,31-35]). In practice, these 
methods are either limited to discrete distributions or are 
difficult to apply for multivariate distributions of more 
than 2 or 3 dimensions. This means that it may not be pos- 
sible to check that the conditionals are compatible with 
the same joint model or that our non-informative mar- 
gins condition holds true, van Buuren and other authors 
[3,18], in the examples they considered, concluded that 
chained equations is a robust approach even when the 
set of conditionals are not compatible with the same 
joint model. The findings of our simulation study sup- 
port this body of work. Other studies [3,4,36,37] have 
compared chained equations and joint modelling, when 
missingness is multivariate, nonmonotone and ignorable, 
in settings which reflect real data (e.g. mixture of different 
types of variables, non-linear relationships and interac- 
tions between variables, semi-continuous variables). None 
of these studies has reported substantial differences in the 
performances of joint modelling imputation and chained 
equations imputation. Nonetheless many authors empha- 
size the need for further understanding of the theoretical 
underpinnings of the chained equations approach and the 
establishment of the robustness of the chained equations 
method (e.g. [3,7,14,38]). 

Conclusions 

In finite samples, compatibility of the conditionals and 
our non-informative margins condition are sufficient for 
chained equations and joint modelling to yield imputa- 
tions from the same predictive distribution. Furthermore, 
our simulation study demonstrated that, even in a simple 
setting, a chained equations procedure that does not sat- 
isfy the non-informative margins condition is not neces- 
sarily equivalent to a joint model procedure, even though 
when its conditional models are compatible. 

When conditionals are incompatible or the non- 
informative margins condition is not satisfied, the dis- 
tribution from which the imputed values are drawn can 
differ according to the order in which the variables are 
updated in the chained equations sampler, thereby intro- 
ducing order effects. 

Given the widespread use of chained equations imputa- 
tion in medical research for datasets with different types 
of variables, researchers must be aware that order effects 
are likely to be ubiquitous. As noted by van Buuren [3], 
further work is needed to verify the robustness of chained 
equations to incompatibility of the conditional models in 
more general and realistic settings. Equally, future work 
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could evaluate the robustness of chained equations impu- 
tation when the sample size is small-to-moderate, the 
conditionals are compatible and the non-informative mar- 
gins condition is not satisfied. 
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